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ABSTRACT 


Computer networks link many activities, events, and applications. 
The network’s performance must be improved and have more 
capacity to handle increased users. The network’s computer system 
should guarantee security, confidentiality, and integrity. An intrusion 
jeopardizes the operation and security of a wired or wireless network 
system. If the invasions are not detected at the appropriate level, the 
loss to the system might be immeasurable. Intrusions occur when 
malicious actors harm information resources. The hackers tamper the 
normal operations or attempt to infiltrate the system via the gateway. 
The study analyzes the attack and normal traffic packets from the 
KDD Cup99 dataset. The KDD Cup99 data includes benchmark 
traffic and intrusion detection features. However, most intrusion 
detection systems today have significant false alarm rates and miss 
many attacks because they cannot distinguish between unlawful and 
unlawful behaviors. 


KEYWORDS: Invasion, Intrusion Detection, KDDCup99, Misuse 
detection, Anomaly detection 


I. INTRODUCTION 

Daily life depends on the availability and processing 
of information quickly. If demand increased in this 
scenario, it would be _ necessary to store 
proportionately more data and resources across > 
numerous computers with the necessary correlation, 

and data interference, unauthorized access, and system 
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formulated for every system based on future 
performance. Computer security is typically based on 
realizing the following factors in a computer system. 


Confidentiality: information is to be accessed 
only by authorized persons. 


and network growth would worsen. The virtual access > Integrity: information must remain unaltered by 


path would grant access to unauthorized network users. 
On the other side, hackers can access confidential data > 
by taking advantage of flaws in networks or systems. 
The constraints on access and security measures are 
insufficient against internal and compromised threats. 


mischievous or malicious attempts. 


Availability: the computer must function without 
degradation of access and impart resources to 
legitimate users when required. 


Recognizing breaches and intrusions is the only proven 
approach to keeping systems and networks safe. Along 
with identifying real attackers, intrusion detection 
systems should also keep track of attempted 
intrusions. 


A trustworthy system should secure its data and 
resources from unauthorized access, tampering, and 
denial of service attacks. The function of any computer 
network system should have some expected level of 
trust and confidence. The security policy must be 


In general, an intrusion is any action attempting to 
compromise a resource’ s confidentiality, integrity, and 
availability. Anderson (1980) defined intrusion as the 
potential opportunity of an intentionally unauthorized 
attempt to access information, manipulate 
information, or make a system untrustworthy. 


Intrusion Detection System (IDS) was commercially 
introduced in the year 1990 [1]. It behaves like a 
burglar alarm that detects invasion and triggers alarms 
like audible, visual, or messages like e-mail. The IDS 
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is used to prevent problem behaviors that attack or 
abuse the system, detect, and deal with attacks. The 
mechanism should have low false alarms while 
ensuring invasion detection. Various approaches are 
present, but they are relatively ineffective in the 
classification and alarm rate dimensions. Machine 
learning-based anomaly detection approaches have 
been effectively used in the network intrusion 
detection scenario because of their intrinsic 
capabilities of discovering new attacks [2]. Most 
existing classification methods are based on neural 
networks, fuzzy logic, genetic algorithm, and support 
vector machines. 


The motivation for this work is to study and analyze 
various attacks and explore benchmark datasets for 
designing the enhanced methodology. The paper’s 
objective is to study the existing available methods to 
explore the possibilities of improved performance. 
The rest of the paper is organized as follows: Section 
1 presents the Introduction to IDS, Section 2 presents 
the surveys of significant work carried out in the 
domain of IDS, Section 3 describes the analysis of 
data applicable to IDS, and finally, the conclusion is 
presented for the entire work. 


Il. RESEARCH BACKGROUND 

Detection of intrusions protects a computer network 
from unauthorized users as well-as insiders attack. 
The intrusion detector task is to construct a predictive 
model or classification method capable of 
distinguishing ‘bad’ connections, called intrusions or 
attacks, and ‘good’ or average connections. IDSs are 
broadly classified into three categories based on 
deployment. 


> Network-based IDS (NIDS): It is a passive 
device that resides in an organization’s computer 
or network and observes the network traffic to 
indicate attacks. It recognizes any attack and 
notifies such malicious codes to system 
administrators immediately. It can be installed in 
the boundary of the router to observe the traffic 
going into and out of the network [3]. The 
minimum number of monitoring units for an 
extensive network can be deployed without 
disturbing the regular operations of networks. It is 
also not vulnerable to direct attack, but it can 
become exhausted by network traffic, unable to 
detect encrypted packets and fail to distinguish 
some attacks. 


> Host-based IDS (HIDS): It resides in the 
computer or server, called the host, and examines 
only the host activities. It is employed to monitor 
the system and stored configuration files and 
detect the intruders’ creation, modification, and 
deletion of system files. It can also detect local 


events and attacks that the NIDS has not detected. 
The configuration of HIDS resides only on an 
individual host and requires more management 
effort to install and configure in multiple hosts. 
Also, HIDS are more vulnerable to direct attacks 
and susceptible to some Denial of Service (DoS) 
attacks [4]. 


> Application-based IDS (AppIDS): It is the 
enhancement of the HIDS, which examines an 
application for abnormal events by looking into 
the files created in the application and anomaly 
occasions such as exceeding the users’ 
authorization, and void file execution. It also 
observes the interaction between the application 
and the user and the encrypted traffic. It is more 
susceptible to attack and does not possess the skill 
to detect software tampering [5]. 


The accuracy of any IDS is measured based on the false 

alarm rate (both positive and negative). Based on the 

detection method, IDSs are classified into: 

> Misuse Detection: In misuse detection or 
signature- based intrusion detection system, the 
signatures or patterns of the known attacks are 
placed in the database. They are matched with the 
signatures of traffic entering the network. In case 
of any attack, the signature can be used to detect it 
accurately. Unfortunately, newly formed attacks 
with modified signatures can go undetected 
within the system and are classified as false 
negatives [6]. In general, many false negatives are 
more associated with signature-based IDS. It is 
also referred to as knowledge-based IDS. 


> Anomaly Detection: The anomaly detection or 
statistical anomaly-based IDS gathers statistical 
summaries by watching the traffic, which is 
known to be expected, and a performance baseline 
is developed. The network activities are 
periodically monitored and compared with the 
baseline of intrusions. The statistical and 
behavioral patterns that detect attacks allow alow 
false negative rate. The behavioral patterns of 
users or programs are used to develop a pattern of 
normal and abnormal activities, which are used to 
detect the occurrence of an attack. Consequently, 
any variation from typical behavior by a user or 
program would be detected, thereby generating an 
alarm. Regrettably, most alarms are benign and 
false positives are derived as a result. It is also 
referred to as behavior-based IDS [7]. 


The fundamental principle of anomaly intrusion 
detection is that any intrusive activity is a subset of 
bizarre action. The intrusion may be recognized based 
on anomalous actions. For example, suppose an 
authorized employee of an organization opens the 
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system after office hours using their official account. 
In that case, it is also considered abnormal, and 
consequently, it may be an intrusion. Likewise, users 
in an organization constantly login out of working 
hours through the official server is also treated as an 
anomaly. The intrusive activity can be carried out as a 
sum of individual activities, and no one is separately 
anomalous. Flagging every part of anomalous 
activities precisely results in false positives or false 
negatives. However, intrusive activity does not 
coincide all the time with anomalous activity. There 
are four possibilities (Sangeetha et al., 2022): 
> Intrusive but not Anomalous: It is also called 
false negatives or Type I errors, in which the 
activity is intrusive and fails to detect because it is 
not anomalous. These are false negatives because 
the IDS falsely reports the absence of intrusions. 
> Not Intrusive but Anomalous: It is also called 
false positives or Type II errors, in which the 
activity is not intrusive and treated as intrusive 
because it is anomalous. These are called false 
positives because the IDS falsely reports 
intrusions. 
> Not Intrusive and not Anomalous: It is also 
called true negatives, in which the activity is not 
intrusive and is not informed as intrusive. 
> Intrusive and Anomalous: It is also called true 
positives, in which the activity is intrusive and 
reported as intrusive because it is also anomalous. 


In an anomaly detection system, the activities of 
various subjects are observed, and profiles are 
generated based on behaviors called master profiles. If 
any behavior changes happen in the upcoming period, 
the new profile measures will be updated periodically. 
The current activities are stored in temporary profiles 
and periodically transferred to a master profile. In 
statistical intrusion detection systems, acquiring user 
activities would be trained regularly using behavioral 
moment, which is used to distinguish the patterns as 
normal or abnormal. The advantage of anomaly 


intrusion detection is that the data point of a specific 
feature that lies away from a multiple of the standard 
deviation (statistics) on both sides of the mean may be 
measured as anomalous. The disadvantages are 
anomaly intrusion detections are not sensitive to the 
order of incidence of events. They will probably miss 
intrusions that are indicated by sequential 
interrelationships among events. Moreover, fixing the 
threshold value of deviation is challenging—the 
shallow threshold setting results in false positives, and 
high-value results in false negatives. 


The false positives are the provocation of intrusion 
detection systems. Anomaly detection systems are 
mainly prone to false positives. Generally, no 
significant rate of false positives in signature-based 
systems is reported if rules are correctly installed. 
Likewise, false negatives are also a problem for IDS. 
Typical data may generate false negatives in misuse- 
based systems due to the resemblance of existing 
attacks. The techniques for detecting intruders have 
evolved to face new attacks. It simplifies that the 
standard and attack packets are indicated by ‘0’ and 
‘1’, respectively. Table 1 presents various works in 
terms of security systems and feature selection. The 
Hybrid Association Classification (AC) approach, a 
hybrid classification methodology, was introduced by 
Hadi et al., (2018). Several rules are developed to 
reflect each attribute, and the number of categorization 
rules is maintained to a minimum. Two Extreme 
Layer Machines (TELM) were suggested by Qu etal., 
(2016) to tackle challenging classification and 
regression problems with little storage. When a neural 
network has a lot of hidden layers, TELM significantly 
improves performance. Nabipour et al. (2020), 
proposed a classification approach for high- 
dimensional situations. The genetic algorithm supports 
the fuzzy rule-based methodology used to create the 
classification model. The guidelines for choosing the 
best features were predicted using the Mixed Integer 
Programming Model. 


Table 1. Chronological Literature Review 


Research Technique Used 

¢ Intrusion Detection 
System (IDS) with Data 
Mining. 

¢ Efficient Data Adapted 
Decision Tree (EDADT). 


Nadiamm ai 
et al. (2014) 


Methodology 


¢ Identify the relevant data. 

¢ Classify the Distributed Denial of 
Service (DDoS) attack using 
labeled data. 


Advantages/Disadvantages 


¢ Efficient 
¢ High Detection Rate 
e High Accuracy 


Shakshuki 


et al., (2012) ¢ IDS for MANET 


e Enhanced Adaptive Acknowledgm 
ent (EAACK) 
¢ Classify the malicious behavior 


¢ High Detection Rate 
« Low False Alarm Rate 


¢ Classify the training data 


Bhatiaet (+ IDS with Artificial ¢ Compare the oversampling of the |* Better Detection Rate 


al., (2017) Neural Network (ANN) 


U2R and R2L e High Accuracy 
¢ Categories the attacks 
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¢ Intrusion Detection ¢ MIL-STD protocol is used forthe |» High Detection Rate 
Yahalom et ; ; Scie - 
al, (2019) System for Hierarchical | training of data ¢ Efficiency 
7 Data ¢ Reduce the false alarm rate « High Accuracy 
¢ Intrusion Detection b ¢ Feature Selection is performed : : 
: - haere : » : — ¢ High Detection Rate 
Liu & Lang | fusion of different using Linear correlation 
. bes « Low False Alarm Rate 
(2019) feature selection coefficient and Cuttlefish : 
; ; e High Accuracy 
algorithms. algorithms 


One of the best approaches to solving the multi-class problem in machine learning is to use a classification system 
based on fuzzy rules. The Cluster Center and Nearest Neighbor feature selection method was put out by Lin et al. 
(2015). It computes the distance between each data sample and its own cluster’s center by calculating the distance 
and then using the same function on the data and the cluster’s closest neighbor. 


Then, using the k-NN classifier, which has a high processing efficiency and detection rate, each piece of data 
may be utilized in the intrusion detection process. Composition of Feature Relevancy is a novel feature selection 
method proposed by Longde et al., (2018). The eight real-world datasets and two different classifiers are used 
to enhance feature selection. Liu et al., (2017) proposed a technique for selecting attributes based on aptitude. 
After identifying the closest traits, the quality is determined. These methods result in superior feature selection 
outcomes. Basu (2019) invented a brand-new data structure called a Grid Count Tree (GCD) to find outliers. It 
may be used to compute numerical value separation and category separation quickly and to separate meaningful 
signals from false data. Both real- world and artificial genetically connected applications are used to evaluate this 
GCD. Cai (2013) introduced the Iterative Self-organizing Map with Robust Distance (ISOMRD) for outlier 
detection based on this situation. When points with similar traits assemble, clusters are created. Many databases 
are processed via iterative processing. It is helpful to locate solutions for dynamic analysis and geographical data 
mining applications. Bai et al., (2016) proposed an outlier detection technique based on the local outlier factor 
for large data sets. Outliers are identified using the Grid-Based Partition Algorithm and the Distributed LOF. The 
data collection is divided into a small number of grid sets, and data nodes are assigned. Tuples are categorized 
using classification as cross- grid tuples or gird local tuples. Dispersed LOF is utilized effectively in distributed 
situations to reduce outliers. Di Mauro et al. (2021), suggested a feature selection technique for two categories of 
data sets. These data sets to aid in the detection of false negatives and improve forecast accuracy. Idris (2014) 
suggested a negative feature selection technique for detecting e-mail spam. NSA- PSO defines a local outlier 
factor to estimate the threshold value. The proposed method outperforms non-FSA techniques. Under the title 
Distribution Estimation based Negative Selection Algorithm, Fouladvand et al., (2017) introduced a novel 
attribute selection approach for normal and self-space using detectors (DENSA). Random detectors performed 
well on arange of real-world data sets in this experiment. 


Various applications exist for the IDS to detect different types of attacks and security violations. It also prevents 
the applications such as Business transaction systems, Document maintenance systems, Banking, Insurance 
Systems, and E-Governance from the adversary. The applications of IDS are not specified because all sensitive 
services are available on the Internet and Intranet. The service providers need to safeguard valuable information 
consistently. The technologies are growing exponentially, and protecting resources is becoming more complex. 
The system framework and the critical elements of the research model are covered in the following section. 


Wl. THEORETICAL FRAMEWORK 

Real-world data must be generated for intrusion detection to evaluate all potential risks. The stages involved in 
data analysis methodically identify patterns in the gathered data and link them to the problem that has been 
recognized. Data modeling will determine how it may be categorized and connected. The accuracy and 
reliability of the data collected for the evaluation are aspects of data quality. Figure 1 presents the theoretical 
Framework for KDD Cup Dataset Analysis. The investigation would be feasible if the data quality and attributes 
for the position were excellent. 
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Figure 1. Theoretical Framework for KDD Cup Dataset Analysis 


The research requires several ground truth databases in its region. In this paper, KDD Cup99 is used for intrusion 
detection systems to classify network traffic. The dataset consists of professional-level interest groups on 
knowledge discovery and data mining (http://www.sigkdd.org/kddcup) (Nguyen et al., 2016). The Lincoln 
Laboratory at Massachusetts Institute of Technology produced standard network traffic data under the auspices of 
DARPA and the Air Force Research Laboratory to evaluate computer network intrusion. The research activity 
mainly focuses on the 1998 and 1999 datasets. Figure 2 presents the KddCup99 dataset description. 
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Figure 2: KddCup99 Dataset Description 


A standardized set of auditable data containing a variety of simulated intrusions data present on the military 
network environment. It emulated on US Air Force LAN, mainly focused on real environment attacks. The raw 
TCP/IP dump data has been captured from the network. A connection between the source IP and destination IP 
address is presented in TCP sequence packets. It starts and stops at certain times that allow data to transfer per 
specific protocol. Furthermore, each connection contains a label that indicates normal or an assault with a specific 
attack type. 


A. Attributes in KddCup99 

The features are grouped into three categories (i) basic features of individual connections, (ii) content features 
within a connection, and (iii) traffic features which are computed using a two-second time. The KDD Cup99 
uses a series of packets with a total of 41 characteristics that are broadcast over two seconds. A packet’s 
fundamental features are represented by features (0-9), content features are represented by features (10-22), 
traffic features are represented by (23-31), and host-based features from (32-41). Some of the terminologies 
associated with the data set are (i) Connections that were established with the same host as the one being utilized 
for the current connection within the previous two seconds are referred to as having the ‘same host,’ and (ii) the 
term ‘same service’ refers to connections that provided the same service as the one being used now within the 
last two seconds. The characteristics based on ‘same host’ and ‘same service’ are collectively referred to as the 
time-based traffic aspects of the connection records. Table 2 present the various attributes of KddCup99 datasets. 
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Table 2. KddCup99 Attribute Description 


Feature Name Variable Type Label Description 


duration C 1 vl __| Connections in seconds 

protocol_type D v2 __| Types of protocol (TCP, UDP, etc.) 
service v3 Network service (HTTP, telnet, etc.) 
flag v4 __| Normal or Error connection status. 
src_bytes v5__| Source to Destination data bytes info 
dst_bytes v6__| Destination to Source data bytes info 
land v7__| 1-Connection from/to host/port. 0-otherwise 
wrong_frag ment v8 | Number of ‘wrong’ fragments 
urgent v9 __| Number of urgent packets 

hot vlO | Count the System Access 
num_failed_logins vll__| Number of failed login attempts 
logged_in vl2_ | 1-Successfully logged; 0- otherwise 
num_comp romised vl3__| Compromised conditions 

root_shell vl4 | 1 - root shell is obtained; 0 - otherwise 
su_attempted v15 | 1-SU root 0 - Otherwise 

num_root vl6_| ‘Root’ accesses 

num_file_c reations vl7_| File creation operations 

num_shells vl8 | Number of shell prompts 


vl9 | Writes, delete and create operations. 

v20_ | Outbound commands in FTP 

v21_ | 1-Login ‘hot’ list (root, adm, etc.); 0-otherwise 
v22_ | 1-Login (guest, anonymous, etc.); 0-otherwise 


num_acces s_files 
num_outbo und_cm ds 
is_hot_login 
is_guest_login 


count v23 | Same Host Connections 
srv_count v24 | Connections to same Service 
serror_rate v25 | ‘SYN’ errors to the same host 
srv serror rate v26 | ‘SYN’ errors to the same service 
rerror_rate v27_ | ‘REJ errors to the same host 


v28 | ‘REJ errors to the same service 

v29 | Same Service and the same host 

v30__| Different Services and the same host 

v31 | Same Service and different hosts 

v32 | Same Host to the Destination Host 

33 Same Service to Destination Host as Current 
Connection 

v34 | Same Service to the Destination Host 


srv_rerror_rate 
same_srv_rate 
diff_srv_rate 
srv_diff_ho st_rate 
dst_host_count 


dst_host_srv_count 


dst_host_sa me_srv_rate 
dst_host_di ff_srv_rate 


v35 | Different Services to the Destination Host 


v36_ | Port Services to the Destination Host 
Different Hosts from the same service to the 


dst_host_same_src_port_rate 
dst_host_srv_diff_host_rate 


BW JW} GB JW] BW [OW] W] QW] OI}OI}D]W]W/ OOM INMININMINMININMININMININ]N] NM] Rf Re] Re] Rey Rel Rel RelR 


ALPALAY] A JAIL A JO] A JAPALAJAIASAJAJIASAIAIDLADIAQJAIAJAJAJASAJAJAIAJAIAIAQ|GIAQJaQ/| Oo 


v37 ae 
destination host 
dst_host_se rror_rate 3 v38 | ‘SYN’ (errors same host to destination) 
‘SYN’ errors from the same service to thi 
dst_host_srv_serror_rate 3 v39 ‘ Deceit = wee 
destination host 
dst_host_rerror_rate 3 v0 ‘REJ’ errors (same host to destination) 
sige Waste senor ale C 3 vAl REJ’ errors (same service to destination) 


* C- Continuous, D- Discrete**1-Intrinsic, 2-Content, 3-Traffic 


The protocol_type, service, flag, land, logged_in, is_hot_login, and is_guest_login is labeled as discrete or 
categorical features, and the other 34 features are labeled as continuous features. Table 3 present the description 
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of various flag values of KddCup99, and the categorical features protocol_type, service, and flag have different 
values listed in Table 4. 


Table 3. Description of flag values 


Flag _ Label Description 
The originator sent an SYN followed by an RST but never see an SYN-ACK from the 
RSTOSO 1 
responder 
RSTR 2 Established, responder aborted 
RSTO 3 Connection established; originator aborted (sent an RST) 
OTH 4 No SYN seen, just midstream traffic (a “partial connection” that was not later closed) 
REJ 5 Connection attempt rejected 
SO 6 A connection attempt was seen, but no reply 
Sl 7 Connection established, not terminated 
92 8 Connection established and the close attempt by originator seen (but no reply from 
responder) 
53 9 Connection established and the close attempt by responder seen (but no reply from 
originator) 
SF 10 __| Normal establishment and termination 
The originator sent an SYN followed by a FIN 
SH 11 (finish ‘flag’) but never saw an SYN-ACK from the responder (hence the connection 


was “half” open) 


Table 4. Various services and flags in the KddCup99 dataset 


Label Service Label Service Label Service 


1 netbios dgm| 25 | Z39 50| 49 | time 

2 netbios_ssn 26 | gopher 50 __| echo 

3 netbios_ns 27 | domain | 51 | Idap 

4 remote_job 28 | finger 52 | link 

5 http_8001 29 | klogin 53. | HTTP 
6 hostnames 30__| kshell 54. | SMTP 
7 uucp_path 31 | supdup 55__| UUCP 
8 http_2784 32 __| systat 56 | auth 
9 iso_tsap 33 | telnet 37 | nnsp 
10__| csnet_ns 34 | shell 58 | nntp 
11 | domain_u 35. | imap4 59 | name 
12 | ftp_data 36 | eco_i 60 | exec 
13 | http_443 O71 eer 1 61 | AOL 
14 | daytime 38 | red_i 62 | IRC 
15__| harvest 39 | pop_2 63 | X11 
16 | discard 40 | pop_3 64 | BGP 
17 __| netstat 41 | login 65 | CTF 
18 | courier 42 | tim_i 66 | MTP 
19 | pm_dump 43 | urhi 67 | re 
20 _| printer 44 | urp_i 68 | ssh 
21 | private 45 | ntp_u 69 | efs 
22 | sql_net 46 __| vmnet 70 | ftp 
23 | tftp_u 47 | other 
24 | sunrpc 48 | whois 


B. Classification of Attacks 
There are varieties of attacks which are entering into the network over a period, and the attacks are classified into 
the following four main classes: 


> Denial of Service: It is a class of attacks where an attacker makes some computing or memory resource too 
busy or too full to handle legitimate requests, denying legitimate users access to a machine. The three 
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different ways to launch a DoS attack are (i) by abusing the computer’s legitimate features, (ii) by targeting 
the implementation bugs, and (iii) by exploiting the misconfiguration of the systems. The DoS attacks are 
classified based on the services an attacker renders unavailable to legitimate users. 


User to Root: The attacker starts with access to a normal user account on the system and gains root access. 
Common programming mistakes and environment assumptions allow attackers to exploit root access’s 
vulnerability. 


Remote to User: The attacker sends packets to a machine over a network that exploits the machine’s 
vulnerability to gain local access as a user illegally. There are different types of R2L attacks, and the most 
common attack in this class is made using social engineering. 


Probing: It is a class of attacks where an attacker scans a network to gather information to find known 
vulnerabilities. An attacker with a map of machines and services available on a network can manipulate the 
information to look for exploits. Different probes exist; some abuse the computer’s legitimate features, and 
some use social engineering techniques. 


Table 5 present the various class of attacks that is most common for the analysis of the KddCup99 dataset. 


Table 5. Various attacks on KddCup99 Dataset 


Attack Type Mechanism | Attack Effect 

back DoS_| Abuse/Bug Slows down server response 
land DoS _| Bug Slows down server response 
Neptune DoS_| Abuse Slows down server response 
smurf DoS_| Abuse Slows down the network 
pod DoS _| Abuse Slows down server response 
teardrop DoS _ | Bug Reboots the machine 
load- module U2R _ | Poor environment sanitation | Gains root shell 
buffer_over flow | U2R_ | Abuse Gains root shell 
rootkit U2R_ | Abuse Gains root shell 
Perl U2R _ | Poor environment sanitation | Gains root shell 
phf R2L | Bug Executes commands as root 
guess_pass wd R2L | Login misconfiguration Gains user access 

R2L | Abuse Gains user access 
warezmaste r 
IMAP R2L | Bug Gains root access 
multihop R2L | Abuse Gains root access 
f ; R2L | Misconfigura tion Gains user access 

tp_write 

R2L | Abuse Gains user access 
Spy 
eS anegelicat R2L | Abuse Gains user access 
suian Probe | Abuse of feature Looks for known vulnerabilities 
N Probe | Abuse of feature Identifies active ports on a machine 

map 
Probe | Abuse of feature Identifies active ports on a machine 

portsweep 
: Probe | Abuse of feature Identifies active machines 
ipsweep 


The data set in KDD Cup99 have normal, 22 attack-type data with 41 features, and Table 6 shows a few data set. 
All generated traffic patterns end with a label either as ‘normal’ or any ‘attack’ for upcoming analysis. 


Table 6. Sample Data Packets 
Feature Name Packet-1 (Normal) Packet-2 (Neptune) 
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duration 0 0 
protocol_type TCP TCP 
service HTTP private 
Flag SF REJ 
stc_bytes 327 0 
dst_bytes 467 0 
Land 0 0 
wrong_fragment 0 0 
urgent 0 0 
Hot 0 0 
num_failed_logins 0 0 
logged_in 1 0 
num_compromised 0 0 
root_shell 0 0 
su_attempted 0 0 
num_root 0 0 
num_file_ creations 0 0 
num_shells 0 0 
num_access_files 0 0 
num_outbound_cmds 0 0 
is_hot_login 0 0 
is_guest_login 0 0 
count 33 136 
srv_count 47 1 
serror_rate 0 0 
srv_serror_rate 0 0 
rerror_rate 0 1 
srv_rerror_rate 0 1 
same_srv_rate 1 0.01 
diff_srv_rate 0 0.06 
srv_diff_host_rate 0.04 0 
dst_host_count 151 255 
dst_host_srv_count 255 1 
dst_host_same_srv_rate 1 0 
dst_host_diff_srv _rate 0 0.06 
dst_host_same_src_port_rate 0.01 0 
dst_host_srv_diff_host_rate 0.03 0 
dst_host_serror_rate 0 0 
dst_host_srv_serror_rate 0 0 
dst_host_rerror_rate 0 1 
dst_host_srv_rerror_rate 0 1 


This section outlines the structure of the dataset used by the Intrusion detection system. The various kinds of 
features, such as discrete and continuous, are studied with a focus on their role in the attack. The attacks are 


classified with a brief introduction to each. 


IV. CONCLUSION 

Any network administrator’s primary priority should 
be intrusion detection. We conducted a thorough yet 
simple study to examine different methods for 
developing Network Intrusion Detection models. 
Several research articles published in various journals 
served as the foundation for the construction of this 
study. Several tables provided in this publication 
analyze the Kddcup99 dataset’s characteristics. The 


many strategies employed by the network intrusion 
detection system are described, along with each one’s 
benefits and drawbacks. It also observed the presence 
of many assault packets, both normal and attack. The 
investigation in this work is broadened based on 
several machine learning methods for identifying the 
system assault. While the machine is given the ability 
to learn, the behavior of the data has been studied for 
further research. 
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